Timm Walz - Microsoft Consulting Services - tiwalz@microsoft.com
The annual “family-level” data files included here (available on the National Bureau of Economic Research (NBER) web site, www.nber.org) are the result of linking the four quarterly interviews for each Consumer Expenditure Survey (CE) respondent family. The processed data also collapses the hundreds of spending, income, and wealth categories into a consistent set of categories across the years. Our hypothetical customer is a manufacturer of consumer goods and we want to find a way to use this data to help our customer.
Propose one way of using this data employing one of the following methods:
Execute your proposal and discuss your methodology, justify your algorithm/ feature selection and share insights from the model. We're interested in understanding your approach as well as your ability to communicate any insights derived from your model to our hypothetical customer. The output should include a PowerPoint or PDF document. You will be expected to present your findings at a scheduled interview to follow.
%load_ext autoreload
%autoreload 2
%matplotlib inline
# Import standard packages
import pandas as pd
import logging
import seaborn as sns
import numpy as np
import sys
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(11,8)})
# Import custom functions, laid out to helper function to keep the code more neat & clean :)
import helper
# Import supplementary price data
df_sup = pd.read_csv('../assets/Data_Supplementary_price.csv')
print(f"Sample of supplementary price data (total length {len(df_sup)})")
display(df_sup.sample(5))
# Import consumer expenditure data
df_exp = pd.read_csv('../assets/Data_consumer_expenditure_survey.csv').set_index('newid')
print(f"\nSample of consumer expenditure data (total length {len(df_exp)})")
display(df_exp.sample(5))
# Replace "\N" in 'nonwork' (reason for not working) by 0 and assume that it means that person works
df_exp['work'] = df_exp['nonwork'].replace('\\N', 0).astype(int)
# Extract demographic information and assign it to a separate data frame
cols_dem = ['age', 'hhsize', 'num_child', 'blsurbn', 'income', 'educatio', 'race', 'sex', 'work', 'empstat', 'occup', 'emptype', 'marital']
df_exp_demographic = df_exp[cols_dem]
# Info about demographic data
df_exp_demographic.info()
First, we want to gain a rough overview and understanding of the underlying data set.
# Plot the supplement dataset to gain an overview on how prices develop over the past couple of years
df_sup_grouped = df_sup.groupby(['year']).mean().reset_index()
dfx = df_sup_grouped.drop(['quarter', 'date'], axis=1).melt('year', var_name='cols', value_name='price over year')
g = sns.factorplot(x="year", y="price over year", hue='cols', data=dfx, size=6, aspect=2)
# Get plots of demographic data, function is imported from helper
helper.get_demographic_plots(df_exp)
# Select columns for expenditure analysis based on the supplement data
# They cover common consumer goods
consum_features = [
'alcohol',
'books',
'clothes',
'elect',
'food',
'foodhome',
'foodout',
'foodwork',
'gambling',
'gas',
'gasoline',
'hlthbeau',
'homefuel',
'homeval2',
'housuppl',
'jewelry',
'tailors',
'telephon',
'tobacco',
'utility',
'water']
# Extract the relevant information from the expenditure data frame
df_exp_values = df_exp[consum_features]
# Create boxplots to see the distribution of the values for every variable
helper.get_expenditure_boxplots(df_exp_values)
# Variables housuppl and toiletry seem to be missing or there are no values, so we remove them from the df
df_exp_values.drop(['housuppl', 'foodwork', 'gambling', 'homefuel'], axis=1, inplace=True)
We have seen that there is quite a lot of noise in the data Therefore, we want to remove the outliers and replace them based on the percentiles (0.05, 0.95)
df_exp_clean = df_exp_values.copy()
def remove_outliers(df_exp_clean):
'''Remove outliers based on percentiles'''
for column in list(df_exp_clean):
percentiles = df_exp_clean[column].quantile([0.05, 0.95]).values
df_exp_clean[column][df_exp_clean[column] <= percentiles[0]] = percentiles[0]
df_exp_clean[column][df_exp_clean[column] >= percentiles[1]] = percentiles[1]
return df_exp_clean
# Run the function
df_exp_clean = remove_outliers(df_exp_clean)
# Run the boxplot function again based on the cleaned data
helper.get_expenditure_boxplots(df_exp_clean)
# Looks much cleaner now!
# Have a glance at the data
df_exp_clean.head()
There are quite a few variables in our data set, which can be overwhelming and complex. We will reduce the dimensions of the data by creating so-called principal components: Principal Component Analysis (PCA) is a dimensionality-reduction method that is used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
# Import required packages
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardizing the features to have less variance in the data and make the values across the variables more homogeneous
expenditures_scaled = StandardScaler().fit_transform(df_exp_clean.values)
# First, we create a plot to see how many principal components we need to picture the data in a way, that there are
# less dimensions, but still enough information left to gain valuable and realistic insights
def get_explained_variance_plot(expenditures_scaled):
'''Gets scaled expenditures and creates a plot for explained variance'''
pca = PCA()
pca.fit_transform(expenditures_scaled)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
fig = px.area(
x = range(1, explained_variance.shape[0] + 1),
y = explained_variance,
labels = {"x": "Number of Principal Components", "y": "Explained Variance"}
)
return fig
# Run the function
get_explained_variance_plot(expenditures_scaled)
# We see that quite a lot principal components are needed to properly describe the data set
# We choose to take 10 principal components, so we reduce the data set by 7 attributes
# Next, we create the principal components given an amount n and create a plot
# how the data looks like in dependence of the age
def get_pca(expenditures_scaled, n_components):
'''Get PCA and variance plot for specific n'''
# Run PCA
pca = PCA(n_components=n_components)
components = pca.fit_transform(expenditures_scaled)
total_var = pca.explained_variance_ratio_.sum() * 100
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Age'
# Plot
fig = px.scatter_matrix(
components,
color=df_exp_demographic['age'],
dimensions=range(n_components),
labels=labels,
title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()
return components
# Run function
components = get_pca(expenditures_scaled, n_components=10)
# We want to figure out, whether the PCAs somehow correlate with each other, which ideally should not be the case
correlation_matrix = np.corrcoef(pd.DataFrame(components).transpose())
sns.heatmap(correlation_matrix, annot = True)
# The result looks quite dense due to the amount of variables,
# however we see a dark color for every correlation, which means that it is ~ 0
As we now reduced the dimensions of the data, we want to find potential clusters within the data set, containing of multiple consumers that have a similar consumer behavior. This is supposed to help us fiding patterns in the data set.
We will use the hierarchical clustering method, as we do not know, how many clusters we will expect. An alternative solution would be k-means clustering, where we would need to provide a number of clusters to be generated. Based on the hierarchical distance illustrated by a dendrogram, we will decide how many clusters we will produce later on.
# Import required packages
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
%%time
# Apply linkage function using ward's method and create dendrogram
merged_data = linkage(pd.DataFrame(components), method='ward')
consumer_dend = dendrogram(merged_data)
# We cut the dendrogram at hierarchical distance of ~80 - so we extract 6 clusters
# Transform data set and attach principal components along with a cluster number
cluster_cut = pd.Series(cut_tree(merged_data, n_clusters = 6).reshape(-1,))
df_pca_cluster = pd.concat([pd.DataFrame(components), cluster_cut], axis=1)
df_pca_cluster.columns = [f'PC{index}' for index in range(1, len(components[0])+1)] + ["cluster_nr"]
df_pca_cluster.head()
# Cluster count
df_pca_cluster.cluster_nr.value_counts().plot(kind="bar")
# Merge demographic information to principal components and the cluster number
df_pca_cluster.index = df_exp_demographic.index
people_data = df_exp_demographic.join(df_pca_cluster)
people_data = people_data.join(df_exp_clean)
people_data.sample(5)
Now, we are going to analyze the clusters created in the step before and detect major differences between the group. For this purpose, we merge the data with demographic data and take mean/median values for the attributes.
median_cols = ['educatio', 'age', 'race', 'empstat', 'occup', 'emptype', 'marital']
skip = ['emptype', 'empstat', 'occup', 'age_group']
# Next, we assemble a dataset based on the averaged consumer data and connect it with the demographic information
def assemble_cluster_report(people_data):
'''Create cluster report based on principal components and average values for every variable'''
cluster_summaries = pd.DataFrame()
for column in cols_dem + list(people_data):
if column in skip:
continue
elif column in median_cols:
cluster_summaries[column] = people_data.groupby(["cluster_nr"])[column].median()
else:
cluster_summaries[column] = people_data.groupby(["cluster_nr"])[column].mean()
return cluster_summaries
# Run function
cluster_summaries = assemble_cluster_report(people_data)
cluster_summaries
# We now define the columns we would like to plot in the next step for the spider plot
plot_clusters = [
'alcohol',
'books',
'clothes',
'elect',
'food',
'foodhome',
'foodout',
'gas',
'gasoline',
'hlthbeau',
'homeval2',
'jewelry',
'telephon',
'tobacco',
'utility',
'water'
]
# After assembling the cluster data set, we see how the typical average expenses are like for every cluster and variable
def get_cluster_barplots(people_data, plot_clusters):
'''Create barplots for average variable values per cluster'''
fig, axs = plt.subplots(len(list(plot_clusters)) // 3 + 1, 3, figsize = (30, 30))
rindex = 0
for index, column in enumerate(list(plot_clusters)):
cindex = index % 3
if index > 0 and cindex == 0:
rindex += 1
vars()[f'plt{index}'] = sns.barplot(x='cluster_nr', y=column, data=people_data, ax = axs[rindex, cindex])
return plt
# Run function
get_cluster_barplots(people_data, plot_clusters)
cluster_summaries[plot_clusters]
# Scale values to have a common scale (e.g. food is way higher than other values in comparison)
scaled = (cluster_summaries[plot_clusters] - cluster_summaries[plot_clusters].min())/(cluster_summaries[plot_clusters].max()-cluster_summaries[plot_clusters].min())
# Next, we create a scatter polar plot, or also called "spiderplot" to illustrate
# the dimensions of each variable given a cluster. This helps us to better compare the consumer behavior in one place.
import plotly.graph_objects as go
def get_spiderplot(scaled, plot_clusters):
'''Create spiderplot to illustrate expense dimensions'''
fig = go.Figure()
for index, row in scaled.iterrows():
fig.add_trace(go.Scatterpolar(
r=row.to_list(),
theta=plot_clusters,
fill='toself',
name=index
))
return fig
# Run function
get_spiderplot(scaled, plot_clusters)
For n_pca=10 and n_clusters=6
Cluster0:
Cluster1:
Cluster2:
Cluster3:
Cluster4:
Cluster 5: